App Classification EDA¶

The idea is to develop a machine learning model that accurately identifies amphibian and insect species and provides information about their characteristics and potential danger to humans. This project addresses the need for precise identification and can benefit conservation efforts and public safety. Gathering a large dataset of images and species information is the first step, followed by training the model using machine learning algorithms. Ethical considerations are important, and success is achieved by obtaining a comprehensive dataset for analysis and model training.

1. Data Understanding¶

1. Import libraries and dataset¶

I will begin by importing the necessary libraries and importing the dataset.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2. Load and inspect the dataset¶

The dataset utilized for this project is sourced from https://observation.org/. This website serves as a platform where users share their observations of various species, accompanied by corresponding images. The platform enables individuals to publish their findings and contribute valuable data regarding different species. By leveraging this dataset, the project gains access to a diverse range of species observations and accompanying images, enriching the information available for training the machine learning model. The inclusion of such a comprehensive and community-driven dataset enhances the accuracy and effectiveness of the model in identifying and analyzing amphibian and insect species.

In [ ]:
import pandas as pd


# Load the two datasets from different directories
file_path1 = 'Data/observationsInsectSpider.csv'
file_path2 = 'Data/observationsSnakesAmphibia.csv'
df1 = pd.read_csv(file_path1)
df2 = pd.read_csv(file_path2)

# concatenate the two datasets
observation_df = pd.concat([df1, df2], ignore_index=True)

# sort the resulting dataframe by the "id" column
observation_df = observation_df.sort_values(by="id")
In [ ]:
observation_df
Out[ ]:
id observed_on_string observed_on time_observed_at time_zone user_id user_login user_name created_at updated_at ... geoprivacy taxon_geoprivacy coordinates_obscured positioning_method positioning_device species_guess scientific_name common_name iconic_taxon_name taxon_id
199440 39 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Taricha torosa California Newt Amphibia 27818
199441 40 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Taricha torosa California Newt Amphibia 27818
199442 80 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Callisaurus draconoides Zebra-tailed Lizard Reptilia 36080
0 203 March 18, 2008 12:00 2008-03-18 2008-03-18 19:00:00 UTC Pacific Time (US & Canada) 49.0 alan99 NaN 2008-05-05 08:32:40 UTC 2023-03-06 04:56:46 UTC ... NaN NaN False NaN NaN Ranchman's Tiger Moth Arctia virginalis Ranchman's Tiger Moth Insecta 626880
1 523 2008-07-13 2008-07-13 NaN Eastern Time (US & Canada) 1.0 kueda Ken-ichi Ueda 2008-07-26 00:22:46 UTC 2022-12-25 18:17:48 UTC ... NaN NaN False NaN NaN Forest Tent Caterpillar Moth Malacosoma disstria Forest Tent Caterpillar Moth Insecta 81663
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
303676 156384671 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Duttaphrynus melanostictus Asian Common Toad Amphibia 62345
303677 156387550 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Bufo bufo Gewone Pad Amphibia 326296
303678 156391245 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Crotalus adamanteus Eastern Diamondback Rattlesnake Reptilia 53491
303679 156392005 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Hyla arborea Boomkikker Amphibia 424147
303680 156400272 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Zamenis scalaris Trapslang Reptilia 540328

303681 rows × 39 columns

3. Check for missing values¶

In [ ]:
observation_df.isna().sum()
Out[ ]:
id                                       0
observed_on_string                  104528
observed_on                         104532
time_observed_at                    115994
time_zone                           104242
user_id                             104241
user_login                          104241
user_name                           155945
created_at                          104241
updated_at                          104241
quality_grade                       104241
license                             150394
url                                 104241
image_url                             1032
sound_url                           303265
tag_list                            288257
description                         233379
num_identification_agreements       104241
num_identification_disagreements    104241
captive_cultivated                  104241
oauth_application_id                219374
place_guess                         104269
latitude                            104241
longitude                           104241
positional_accuracy                 143310
private_place_guess                 303681
private_latitude                    303681
private_longitude                   303681
public_positional_accuracy          143252
geoprivacy                          303681
taxon_geoprivacy                    280312
coordinates_obscured                104241
positioning_method                  273362
positioning_device                  271583
species_guess                       106951
scientific_name                          1
common_name                          27771
iconic_taxon_name                        0
taxon_id                                 0
dtype: int64

4. Check for unique values¶

In [ ]:
# Check how many unique species are in the dataset
observation_df['scientific_name'].nunique()
Out[ ]:
21196

There are 21.196 distinct species in this dataset, but since there are 199,439 rows, it appears that some of the species are duplicated within the dataset.

In [ ]:
observation_df['iconic_taxon_name'].unique()
Out[ ]:
array(['Amphibia', 'Reptilia', 'Insecta', 'Arachnida'], dtype=object)

The dataset has 4 different categories:

  • Amphibia
  • Reptilia
  • Insecta
  • Arachnida

2. Data Cleaning¶

In this step, I will eliminate the unnecessary columns for this application. Then, I will examine the dataset for null values and remove any duplicated species within the dataset.

In [ ]:
observation_df.head()
Out[ ]:
id observed_on_string observed_on time_observed_at time_zone user_id user_login user_name created_at updated_at ... geoprivacy taxon_geoprivacy coordinates_obscured positioning_method positioning_device species_guess scientific_name common_name iconic_taxon_name taxon_id
199440 39 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Taricha torosa California Newt Amphibia 27818
199441 40 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Taricha torosa California Newt Amphibia 27818
199442 80 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN Callisaurus draconoides Zebra-tailed Lizard Reptilia 36080
0 203 March 18, 2008 12:00 2008-03-18 2008-03-18 19:00:00 UTC Pacific Time (US & Canada) 49.0 alan99 NaN 2008-05-05 08:32:40 UTC 2023-03-06 04:56:46 UTC ... NaN NaN False NaN NaN Ranchman's Tiger Moth Arctia virginalis Ranchman's Tiger Moth Insecta 626880
1 523 2008-07-13 2008-07-13 NaN Eastern Time (US & Canada) 1.0 kueda Ken-ichi Ueda 2008-07-26 00:22:46 UTC 2022-12-25 18:17:48 UTC ... NaN NaN False NaN NaN Forest Tent Caterpillar Moth Malacosoma disstria Forest Tent Caterpillar Moth Insecta 81663

5 rows × 39 columns

1. Remove Columns¶

The columns I want to keep in this dataset are:

  • 'id': This column can uniquely identify each observation
  • 'scientific_name': This column is important as it contains the species names which will be the target labels for the machine learning model.
  • 'common_name': This column might be useful to provide more user-friendly information about the species
  • 'iconic_taxon_name': This column can be useful to filter and focus on amphibians and/or insects.
  • 'image_url': This column can be user to download the images from the dataset

The other columns can be removed.

In [ ]:
columns_to_drop = [
    'observed_on_string', 'observed_on', 'time_observed_at', 'time_zone',
    'user_id', 'user_login', 'user_name', 'created_at', 'updated_at',
    'quality_grade', 'license', 'url', 'sound_url', 'tag_list',
    'description', 'num_identification_agreements',
    'num_identification_disagreements', 'captive_cultivated', 'oauth_application_id',
    'place_guess', 'latitude', 'longitude', 'positional_accuracy',
    'public_positional_accuracy', 'geoprivacy', 'taxon_geoprivacy',
    'coordinates_obscured', 'positioning_method', 'positioning_device',
    'species_guess', 'private_place_guess', 'private_latitude', 'private_longitude', 'taxon_id'
]

data_filtered = observation_df.drop(columns=columns_to_drop)
In [ ]:
data_filtered
Out[ ]:
id image_url scientific_name common_name iconic_taxon_name
199440 39 https://inaturalist-open-data.s3.amazonaws.com... Taricha torosa California Newt Amphibia
199441 40 https://inaturalist-open-data.s3.amazonaws.com... Taricha torosa California Newt Amphibia
199442 80 https://inaturalist-open-data.s3.amazonaws.com... Callisaurus draconoides Zebra-tailed Lizard Reptilia
0 203 http://static.inaturalist.org/photos/132/mediu... Arctia virginalis Ranchman's Tiger Moth Insecta
1 523 https://inaturalist-open-data.s3.amazonaws.com... Malacosoma disstria Forest Tent Caterpillar Moth Insecta
... ... ... ... ... ...
303676 156384671 https://static.inaturalist.org/photos/27041880... Duttaphrynus melanostictus Asian Common Toad Amphibia
303677 156387550 https://inaturalist-open-data.s3.amazonaws.com... Bufo bufo Gewone Pad Amphibia
303678 156391245 https://static.inaturalist.org/photos/27043003... Crotalus adamanteus Eastern Diamondback Rattlesnake Reptilia
303679 156392005 https://inaturalist-open-data.s3.amazonaws.com... Hyla arborea Boomkikker Amphibia
303680 156400272 https://inaturalist-open-data.s3.amazonaws.com... Zamenis scalaris Trapslang Reptilia

303681 rows × 5 columns

2. Look at the new dataset¶

Given that the dataset is considerably smaller now, I intend to review it once more to identify any missing or duplicate values.

In [ ]:
data_filtered.isna().sum()
Out[ ]:
id                       0
image_url             1032
scientific_name          1
common_name          27771
iconic_taxon_name        0
dtype: int64

There is one scientific_name missing in this dataset, let's have a look at this row to check if it can be removed or not.

In [ ]:
missing_scientific_name = data_filtered[data_filtered['scientific_name'].isna()]
print(missing_scientific_name)
               id                                          image_url  \
197083  152245435  https://static.inaturalist.org/photos/26286780...   

       scientific_name common_name iconic_taxon_name  
197083             NaN         NaN           Insecta  

This particular row does not provide any relevant information about the observation, and therefore, can be eliminated.

In [ ]:
# remove the empty row
data_filtered = data_filtered.dropna(subset=['scientific_name'])

data_filtered.isna().sum()
Out[ ]:
id                       0
image_url             1032
scientific_name          0
common_name          27770
iconic_taxon_name        0
dtype: int64
In [ ]:
# Check how many unique species are in the dataset
data_filtered['scientific_name'].nunique()
Out[ ]:
21196

3. Visualize the data¶

In the last part of this EDA, I'm going to visualize the data.

1. Visualize the distribution of 'iconic_tacon_name'¶

In [ ]:
import plotly.graph_objs as go
import plotly.io as pio

# Enable notebook renderer
pio.renderers.default = 'notebook'

# Counting how many unique species per category
insecta_unique_count = data_filtered[data_filtered['iconic_taxon_name'] == 'Insecta']['scientific_name'].nunique()
arachnida_unique_count = data_filtered[data_filtered['iconic_taxon_name'] == 'Arachnida']['scientific_name'].nunique()
amphibia_unique_count = data_filtered[data_filtered['iconic_taxon_name'] == 'Amphibia']['scientific_name'].nunique()
reptilia_unique_count = data_filtered[data_filtered['iconic_taxon_name'] == 'Reptilia']['scientific_name'].nunique()

# Create a list of tuples with taxon groups and their unique species counts
taxon_counts = [
    ('Insecta', insecta_unique_count),
    ('Arachnida', arachnida_unique_count),
    ('Amphibia', amphibia_unique_count),
    ('Reptilia', reptilia_unique_count)
]

# Sort the list in descending order based on unique species count
taxon_counts.sort(key=lambda x: x[1], reverse=True)

# Separate the sorted taxon groups and species counts into separate lists
sorted_taxon_groups, sorted_species_counts = zip(*taxon_counts)

# Create a bar plot
fig = go.Figure()

fig.add_trace(go.Bar(
    x=list(sorted_taxon_groups),
    y=list(sorted_species_counts),
    text=list(sorted_species_counts),
    textposition='auto',
    marker_color=['#377eb8', '#e41a1c', '#4daf4a', '#984ea3']  # Set colorblind-friendly colors for the bars
))

# Customize the plot layout
fig.update_layout(
    title='Unique Insecta, Arachnida, Amphibia, and Reptilia Species',
    xaxis_title='Taxon Group',
    yaxis_title='Unique Species Count'
)

# Display the interactive plot
fig.show()

print(f"Number of unique Amphibia species: {amphibia_unique_count}")
print(f"Number of unique Reptilia species: {reptilia_unique_count}")
print(f"Number of unique Insecta species: {insecta_unique_count}")
print(f"Number of unique Arachnida species: {arachnida_unique_count}")
Number of unique Amphibia species: 1861
Number of unique Reptilia species: 2364
Number of unique Insecta species: 15444
Number of unique Arachnida species: 1527

2. Visualize the distribution of observations across the top 10 most observed species¶

In [ ]:
N = 10
top_species = data_filtered['scientific_name'].value_counts().head(10)
sns.barplot(x=top_species.index, y=top_species.values)
plt.xticks(rotation=55)
plt.ylabel('Number of Observations')
plt.title(f'Top {N} Most Common Species')
plt.show()

3. Visualize the distribution of observations with and without common name¶

In [ ]:
# Create a copy of the filtered DataFrame
data_filtered_copy = data_filtered.copy()

# Create a new column 'has_common_name' in the copied DataFrame
data_filtered_copy['has_common_name'] = ~data_filtered_copy['common_name'].isna()

# Create the count plot
sns.countplot(x='has_common_name', data=data_filtered_copy)
plt.title('Observations with and without Common Name')
plt.show()

4. Visualize top 10 observation per taxon name.¶

In [ ]:
import plotly.graph_objs as go

def get_top_species(data, taxon, N=10):
    data_taxon = data[data['iconic_taxon_name'] == taxon]
    top_species = data_taxon['scientific_name'].value_counts().head(N)
    return top_species

top_insect_species = get_top_species(data_filtered, 'Insecta')
top_arachnida_species = get_top_species(data_filtered, 'Arachnida')
top_amphibia_species = get_top_species(data_filtered, 'Amphibia')
top_reptilia_species = get_top_species(data_filtered, 'Reptilia')

fig = go.Figure()

fig.add_trace(go.Bar(x=top_insect_species.index,
                     y=top_insect_species.values,
                     name='Insecta',
                     marker_color='rgb(58, 200, 225)'))

fig.add_trace(go.Bar(x=top_arachnida_species.index,
                     y=top_arachnida_species.values,
                     name='Arachnida',
                     marker_color='rgb(58, 71, 80)'))

fig.add_trace(go.Bar(x=top_amphibia_species.index,
                     y=top_amphibia_species.values,
                     name='Amphibia',
                     marker_color='rgb(204, 204, 0)'))

fig.add_trace(go.Bar(x=top_reptilia_species.index,
                     y=top_reptilia_species.values,
                     name='Reptilia',
                     marker_color='rgb(229, 121, 36)'))

fig.update_layout(
    title='Top 10 Most Common Species for Insecta, Arachnida, Amphibia, and Reptilia',
    xaxis=dict(title='Species'),
    yaxis=dict(title='Number of Observations'),
    legend=dict(x=0, y=1.0),
    barmode='group',
    bargap=0.15,
    bargroupgap=0.1,
    plot_bgcolor='white',
    xaxis_tickangle=-45
)

fig.show()
In [ ]:
data_filtered
Out[ ]:
id image_url scientific_name common_name iconic_taxon_name
199440 39 https://inaturalist-open-data.s3.amazonaws.com... Taricha torosa California Newt Amphibia
199441 40 https://inaturalist-open-data.s3.amazonaws.com... Taricha torosa California Newt Amphibia
199442 80 https://inaturalist-open-data.s3.amazonaws.com... Callisaurus draconoides Zebra-tailed Lizard Reptilia
0 203 http://static.inaturalist.org/photos/132/mediu... Arctia virginalis Ranchman's Tiger Moth Insecta
1 523 https://inaturalist-open-data.s3.amazonaws.com... Malacosoma disstria Forest Tent Caterpillar Moth Insecta
... ... ... ... ... ...
303676 156384671 https://static.inaturalist.org/photos/27041880... Duttaphrynus melanostictus Asian Common Toad Amphibia
303677 156387550 https://inaturalist-open-data.s3.amazonaws.com... Bufo bufo Gewone Pad Amphibia
303678 156391245 https://static.inaturalist.org/photos/27043003... Crotalus adamanteus Eastern Diamondback Rattlesnake Reptilia
303679 156392005 https://inaturalist-open-data.s3.amazonaws.com... Hyla arborea Boomkikker Amphibia
303680 156400272 https://inaturalist-open-data.s3.amazonaws.com... Zamenis scalaris Trapslang Reptilia

303680 rows × 5 columns

Conclusion¶

In this analysis, we performed an Exploratory Data Analysis (EDA) on a dataset containing observations of Insecta, Arachnida, Amphibia, and Reptilia. We aimed to explore the data and understand the distribution of species and the availability of common names in the dataset.

First, we filtered the dataset to include only relevant columns and taxon groups, focusing on insects, arachnids, amphibians, and reptiles. We checked for missing values and removed rows with missing scientific names. We visualized the distribution of observations across the top 10 most observed species and calculated the proportion of observations with an associated common name.

Next, we created a series of visualizations to better understand the data:

  1. Distribution of Insecta, Arachnida, Amphibia, and Reptilia observations.
  2. Top 10 most common species.
  3. Distribution of observations with and without a common name.

These visualizations helped us gain insights into the taxonomic diversity of the dataset and identify potential imbalances or patterns that could affect a machine learning model's performance.

Finally, we created an interactive plot that displays the top 10 most common species for Insecta, Arachnida, Amphibia, and Reptilia. The interactive plot allows for easy exploration and comparison of the most common species in all four taxonomic groups.

In conclusion, the EDA we performed provided valuable insights into the dataset's structure, distribution, and potential issues. These insights can inform the data preparation process and guide the development of a machine learning model to identify different species of insects, arachnids, amphibians, and reptiles, as well as provide information about their characteristics and potential danger level to humans.

In [ ]:
# Specify the file path for the new CSV file
new_file_path = 'Data/observation.csv'

# Save the updated DataFrame to the new CSV file
data_filtered.to_csv(new_file_path, index=False)
In [ ]: